arXiv: 1505.06318v4 [stat.ME] 11 Aug 2016 


Approximate maximum likelihood estimation using data-cloning ABC 


Umberto PicchinU’*, Rachele Anderson’^ 

“'Centre for Mathematical Sciences, Box 118 Lund University, SE-22100 Lund, Sweden 


Abstract 

A maximum likelihood methodology for a general class of models is presented, using an 
approximate Bayesian computation (ABC) approach. The typical target of ABC methods are 
models with intractable likelihoods, and we combine an ABC-MCMC sampler with so-called 
“data cloning” for maximum likelihood estimation. Accuracy of ABC methods relies on the use 
of a small threshold value for comparing simulations from the model and observed data. The 
proposed methodology shows how to use large threshold values, while the number of data-clones 
is increased to ease convergence towards an approximate maximum likelihood estimate. We 
show how to exploit the methodology to reduce the number of iterations of a standard ABC- 
MCMC algorithm and therefore reduce the computational effort, while obtaining reasonable 
point estimates. Simulation studies show the good performance of our approach on models 
with intractable likelihoods such as g-and-k distributions, stochastic differential equations and 
state-space models. 

Keywords: approximate Bayesian computation, intractable likelihood, MCMC, state-space 
model, stochastic differential equation 


1. Introduction 


We present a methodology for approximate maximum likelihood estimation that uses ap¬ 


proximate Bayesian computation (ABC, Tavare et ah, 1997, Pritchard et ah, 1999, Marjoram 


et ah, 2003). The method is applicable to a very general experimental setup, valid for both 
“static” and “dynamic” models. Our main question is: since in ABC studies artihcial datasets 
are produced from the data generating model and compared to the observed data according to 
a threshold parameter 5 (the smaller the 5 the better the inference), what can we do if we are 
unable to reduce 6 below a certain level? Or alternatively, can we perform inference using a 
relatively large 6 and fewer Markov chain Monte Carlo (MCMC) simulations, instead of progres¬ 
sively decrease 6 at the expense of using many MCMC iterations? Suppose we have obtained 
a very rough approximation to the posterior distribution for the unknowns, if we have at least 
located the main mode for the posterior can we conduct approximate maximum likelihood infer¬ 
ence? Basically by accepting of having obtained a poor approximation to the posterior, except 
for the location of its main mode, we switch to maximum likelihood estimation by proposing 
draws around such approximated mode using a special ABC-MCMC sampler. 

Let Y ~ f{y\',4') denote a realization from an observable random variable, i.e. /(•) is 
the data generating mechanism. Depending on the modelling scenario, Y might be observed 
conditionally on an unobservable random variable X, i.e. Y ~ /(y|A, (^), or X might be hxed 
and known (e.g. a set of deterministic inputs or covariates). In any case, there is dependence 
on an unknown (vector) parameter (f>. The statistical methods we are going to introduce have 
general appeal, however in order to set a working framework for the time being we assume to deal 


with state-space models (also known as Hidden Markov Models, Cappe et al., 2005). We remark 
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that our methods based on ABC are not restricted to state-space, nor dynamic models. For 
example, below we assume X a Markov process, but this is not a requirement nor is conditional 
independence of measurements. 

Consider an observable, discrete-time stochastic process {Yt}t>toj Yt ^ and a latent 
and unobserved continuous-time stochastic process {Xt]t>to, Xt £ X C Process Xt ~ 

g{xt\xt-i,r]) is assumed Markov with transition densities g{-) depending on another unknown 
(vector) parameter g. We think at {!*} as a measurement-error-corrupted version of {Xt} and 
assume that observations for {Yt} are conditionally independent given {Xt}. Our state-space 
model can be summarised as 

iyt ^ f{yt\Xt,(j)), t>to 
[Xt ~ g{xt\xt-i,g). 

Typically /(•) is a known function set by the modeller, whereas g{-) is often unknown (e.g. 
when {Xt} is a diffusion process, i.e. the solution of a stochastic differential equation) except 
for very simple toy models. Goal of our work is to estimate the parameters 6 = (g, 4>) using 
observations y = {yo,yi, ...,yn) from {Yt}t>to collected at discrete times {to,ti, ...,tn}. As an 
example, {Yt}t>to ™ay be defined as 

Yt,=Xt^+eg, eg^Pei(p), i = 0,l,...,n (2) 


with {et} representing unobservable noise sources (e.g. measurement errors) having distribution 
with probability density function (pdf) Pe(-). 

In Bayesian inference the goal is to analytically derive the posterior distribution vr(0|y) or, 
most frequently, implement an algorithm for sampling draws from the posterior distribution. 
Sampling procedures are often carried out using Markov chain Monte Carlo (MCMC) or Se¬ 
quential Monte Carlo (SMC) methods embedded in MCMC procedures (Andrieu et ah, 2010). 
However the last ten years have seen an explosion of methodological advances for the so-called 
approximate Bayesian computational methods. We propose to use an ABC-MCMC sampler 
as a “workhorse” to obtain an approximate MLE for 9. Notice that (approximate) Bayesian 
algorithms leading to an approximate MLE have also been considered in [Rubio and Johansen 
(2013). See Grazian and Liseo (2015) for ABC strategies for “integrated likelihoods”, where 
nuisance parameters are integrated out. 

The paper is organized as follows: in section 1.1 we briefly review some properties of “data 
cloning” for maximum likelihood estimation; in section we introduce topics of approximate 
Bayesian computation methodology, first by considering some basics (section |2.1| ) then introduc¬ 
ing our original contribution in sections 2.2f 2.3 Einally in section [^simulation studies illustrate 
results. 


1.1. Data cloning 


A Bayesian procedure for maximum likelihood estimation based on “replicating” data was 


first introduced in 

Robert 

(] 

993) (see also 

Robert and Titterington 

1998) and then studied 

under different flavors by e.g. 

Doucet et al. 

2002 

), Jacquier et al. 

(200 

7) and 

Lele et al. 

(2007). 


The latter introduced the term “data-cloning” which we employ. For simplicity, in the following 
we always consider the full vector parameter 0, and it should be understood that some of its 
components enter /(•) while others enter g{-). The likelihood function of 6 for the state-space 
model 0 can be written as 


L{9] y) = p{yo-, 9)Y\p{yi\yi, ...,yi-i] 9) 

i=l 

« n 

= / fiyo\xo-,0)p{xo) Ylifiyil Xi, 9)g{xi\xi-i,9)}dxo ■ ■ ■ dxr^ 


2 = 1 


( 3 ) 
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where xq = Xt^ and p(xo) the corresponding unconditional density. Actually in what follows, 
and without loss of generality, we assume xq deterministic and known, hence we remove p{xq) 
from the expression of the likelihood function. Notice the latter equality in Q exploits the notion 
of conditional independence between observations and the Markovian nature of the latent state. 
We now consider “cloning” the data y, i.e. we choose a positive integer K, produce K copies 
of y and stack them in = {y,y, ■■■,y) where y is replicated K times. Now, we generate K 
independent vectors for process {Xt}t>to from g{-) at times {to, •••jtn}, say X^^\X^‘^\ ...,X^^\ 
all simulated conditionally on the same value of 9. Therefore we set ...,xj:^^} for 

a generic k {k = 1, All vectors in y^^^ are assumed to be conditionally independent of 

each other, given their individual latent state X^^\ for example we could imagine a series of K 
independent experiments leading to exactly the same result. Then the likelihood function for 
the cloned data y^^^ results in 


L(0;yW) = 


w 



e)p{x^^^\9)}dx^^'^ ■ ■ ■ dx^^^ 


k=l 


(4) 


where we denote with I') the joint density of vector A hence p(A(^)|0) = __,g{xf\x\%e). 

Now, for each k a given term in the product in Q depends on A^^^ and not on terms having 
different indices k' {k' ^ k). Therefore 0 is a product of K integrals, each returning the 
likelihood function based on y, and we can write 


= n / f{y\x^^\e)p{x^^Mdx^^^ = {my)f. 


(5) 


Therefore the likelihood function for the cloned data is the likelihood based on the actual mea¬ 
surements raised to the power K. Now, it is clear that the MLE of 0 is the argmax for both 
L(9]y^^'^) and L{9;y). By considering a prior distribution '7r(0), we have the posterior distribu¬ 
tion resulting from a data-cloned likelihood 

7r(6'|y(^)) oc L{9-, y))'^7r(6'). (6) 


It is easy to prove that for a large enough K the mean of the posterior 7r(0|y(^)) approaches the 


MLF of 9 regardless the specific choice of tt{9) (|Lele et al. 

2007 

) and a central limit theorem 

can be derived (Jacquier et al. 

2007 

|Lele et al.[ 

2010 

). However, simulations experiments in 

Lele et al.j( 

2007) show that using informative priors 

enable a more rapid convergence to the MLF. 


In algorithm[^we consider a generalization of the data-cloning Metropoli s-Hastings algorit hm 


for sampling from 7r(0|y*'^^) (we call it “generalization” simply because in Lele et al.[ |2007| the 
proposal distribution for 9 is '/r(0), while we consider a general proposal u(-)). Consider a 
proposed value 9^ generated from a distribution having density u{9'^\9*) and a proposal 
generated from which we set for convenience to be 

The notation := means “assign the value on the right hand side to the left hand side”. For a 
large enough number of iterations R this algorithm produces a chain having 7r(0, A|y(^)) as its 
stationary distribution, with A = (A^^^,..., A^^^). In order to obtain draws from the desired 
marginal distribution 7r(0|y(^^) it is sufficient to discard the ..., obtained 

from the algorithm output {0, A^^),..., (after some appropriate burnin period). 

For A —7- oo the sample mean of the {9}j is the MLF of 9 and K times the covariance matrix 
of the draws returns the covariance of the MLF, the inverse of the Fisher information based on 
the original data (Jacquier et ah, 2007, Lele et al., |2010 ). Also, for A —)• oo and independently 
of the chosen prior, 7r(0|y^^^) is degenerate at 9 = 9, where 9 is the MLF of 9. 

Notice the simplification occurring in the expression for a, due to taking v{X\9) = p{X\9), 
this resulting in Q . The simplification above solves the typically difficult problem of not having 
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Algorithm 1 A data-cloning MCMC algorithm 


1. Initialization: Fix a starting value 6* or generate it from its prior 7r{0) and set 9i = 9*. Set 


j = 1- 

2. Generate K independent values of X, denoted from p{X\9*). 

3. Calculate 


K 


q* = llf{y\X<>^\ 9*). 

k=l 


4. Generate a ~ 10*). Generate independent from p(A|0^). Com¬ 

pute q* = 

5. Generate a uniform random variable cu ~ [/(0,1), and calculate the acceptance probability 


a = min 1 


q* ■p{X*W\9*)---p{X*(^')\9#) 
q* ■ p{X*W\9*) ■ ■ •p(A*(-^)l®*) 

'-V-' 

ratio of likelihoods 


v{X<^^\9*)---v{X<^')\9*)u{9*\9*) 

^ u(A#(i)|0#)... v{X*(^)\9*)u{9*\9*) ^ 

^-V-^ 

ratio of proposals 


Tr{9^) 

Tr{9*) 

ratio of priors 


= min 



n(6»*|6»#) 

u(e#\e*} 


7T{e#) 

n{e*) 


(7) 


If w > a, set 9j+i := 9j otherwise set 9j^i := 0^, 9* := 0^ and q* := Increase j by 1 and 
go to step 6. 

6. Repeat steps 4-5 as long as j < R for R “large”. 


a ready expression for the transition densities of {Xt}t>to- In fact, here all we need is the ability 
to (somehow) simulate the process {Xt}t>to^ nnd having access to transition densities is not 
strictly required. For example, in section [3^ we know the solution of the considered stochastic 
differential equation (SDE) model, so we can simulate from it. When the exact solution to 
an SDE is not available, a numerical discretization with stepsize h (e.g. the Euler-Maruyama 
scheme) generates an approximate solution, converging to the exact one as /i —)• 0. Beskos et al. 


(2006) even devised a numerical scheme resulting in exact simulation of the SDE solution (i.e. 
without discretization error), though this is of not so general applicability. 

It is important to realize that dealing with a “powered-up” posterior as in results in a sur¬ 
face having increasingly peaked modes for increasing K and deeper “valleys” in-between modes 
enclosing smaller and smaller probability mass (assuming the existence of multiple modes). 
Therefore we believe it is important not to let K fixed to a large value from the start of the 
algorithm, but instead start with a small value for K and then increase it progressively. En¬ 
abling a smooth and not too rapid increase of K should help the chain from being stuck in 
low-probability regions. However in the examples discussed in sections 3.1 3.3| a rapid increase 
in K is possible. 


2. Approximate inference using ABC with data-cloning 

Acceptance of proposals in MGMC algorithms is particularly challenging when the modelled 
process is highly erratic, for example when the unobserved state is a diffusion process, that 
is a solution to a stochastic differential equation (SDE, e.g. Euchs, 2013). Eor such class of 


models, trajectories for {A^} may result quite distant from the observed data y, even for values 
of the parameters in the bulk of their posterior distributions. In such circumstance will 
often be small compared to q*, and the proposal will rarely be accepted. For example, when 
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using an approach as the one described above, where trajectories are simulated “blindly” from 
p{X\-), that is unconditionally to data, then trajectories do not exploit direct knowledge of the 
data. This sometimes result in many rejected proposals if the sample size is large. Carefully 
tuned Sequential Monte Carlo (SMC) strategies can be constructed so that the best trajectories 
(“particles”) are selected according to their proximity to data, and this have pushed forward 


Bayesian inference via MCMC methods incorporating SMC (Andrieu et ah, 2010). 


However for complex (ideally multidimensional) stochastic models and a large number of 
observations, use of SMC methods is computer intensive. Approximate Bayesian computation 


(ABC, see reviews by Sisson and Fan, 2011 and Marin et ah, 2012) eases sampling from an 


approximation of the posterior distribution, by substituting likelihood function evaluations with 
simulations from the data generating model. Here follows a short discussion on ABC which will 
ease the introduction to our original contribution. 


2.1. Basics of ABC 

Here we summarize some minimal notions of ABC methodology, without considering for the 
moment the data-cloning scenario, hence in this section it can be assumed that K = 1. The 
ABC approach considers generating samples z from /(•) in Q (i.e. z G Y, same as the actual 
data) and corresponding proposals 6"^ are accepted if the z are “close” to data y, according to 
a threshold <5 > 0. Several criterion for “closeness” can be postulated, as described below. In 
ABC we aim at simulating draws from the augmented approximated posterior 


TTsie, z\y) oc Ji{y, z) L{6; z)tt{0) (8) 

'-V-" 

oc7r{0\z) 

where z = (zo,...,Zn) and L{0;z) is the (intractable) likelihood function for 0 based on z. 
Then Trs{0\y) oc f 7rs(0, zly)dz. Here Js(-) is some function that depends on S and weights 
the intractable posterior for simulated data t^{0\z) oc L{0; z)'k{0) with high values in regions 
where z and y are similar. Therefore we would like (i) Jsi-) to give higher rewards to proposals 
corresponding to z having values close to y. In addition (ii) Js{y, z) is assumed to be a constant 
when z = y (i.e. when <5 = 0) so that the exact marginal posterior 7r(0|y) is recovered. A 
common choice for Js{y, z) is the uniform kernel 


JsiVi ■2') OC ^{p(z,y)<S} 

where p{z, y) is some measure of closeness between y and z and I is the indicator function. 


Important alternatives are the Epanechnikov and Gaussian kernels (Beaumont 2010) or sums 


of discrepancies ( 

Toni et al. 

2009 

. An ABC-MCMC algorithm targeting the distribution (1^ has 

been proposed in 

Marjoram et al. 

(2003 

). However, one of the difficulties is that, in practice, S has 


to be set as a tradeoff between statistical accuracy (with a small positive 5) and computational 
feasibility (<5 not too small). Also, notice that ABC methods are most often applied to models 
where a set of low-dimensional summaries of the data S{y) is employed rather than the full 
dataset y. That is whenever is possible (and even more so when S'(-) is sufficient for 0), it is 
advisable to consider Js{S{y), S{z)) instead, so that for example we have 


7rs{0,z\p{S{z),S{y)) < 6) (x L{0] z)7r{0)I{p^s{z),S{y))<5} 
when Js{S{y),S{z)) oc I{p( 5 ( 2 ),s(y))< 5 }- 

The introduction of summaries S'(-) is a double edged sword. On one hand the specification 
of appropriate (i.e. informative, though usually not sufficient) statistics is not trivial, especially 
for dynamic models, whereas it is somehow more intuitive to specify them for static models. 
On the other hand having an informative set of statistics implies a considerable reduction in 
the number of elements to be compared {dg comparisons, where dg ■= dim(S') instead of the n 
comparisons required when simulated and observed data have to be compared, with dg <C n) 
and consequently a much smaller 6 can be employed. 
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2.2. ABC-DC: Data-cloning ABC 

In most problems (5 is a strictly positive value, sometimes set “small enough”, some other 
times set to a larger than desired value, depending on the complexity of the experimental 
scenario. In fact when using a very small 5 to obtain accurate inference, this results in a high 
rejection rate, often too high to be computationally feasible. Therefore we propose to consider 
a larger 6 than what would typically be considered as appropriate, coupled with data-cloning. 
Our idea is that if we use K = 1 while dynamically decrease 5 in an ABC-MCMC algorithm, to 
reach a moderately large <5-value that still allows exploration of the posterior surface (producing 
an acceptance rate of, say, 10-15%, and this phase could be considered as “burn-in”) we can then 
start a data-cloning procedure and progressively enlarge K while keeping 5 constant to its last 
value. During the initial exploration (burn-in with /C = 1) we require a 5 which is small enough 
to locate the approximate position of the maxima for the exact marginal posterior 7r(0|y), not 
a 5 producing an accurate approximation to the surface of 7r(0|y). When we start increasing K 
the marginal posterior will concentrate around its maxima, which for large enough 

K should be approximately at the same location as the MLE. 

More in detail we propose to consider R iterations of an ABC-MCMC algorithm later denoted 
as ABC-DC; (i) start an ABC-MCMC algorithm without data-cloning {K = 1) and let 5 decrease 
during this phase; an initially large 6 will enable a rapid exploration of the posterior surface at 
a high acceptance rate (say 30%) to locate the bulk of the approximated posterior, (ii) while 6 
progressively decreases, the algorithm focus on exploring a more accurate approximation of the 
posterior until 6 reaches say 10-20% acceptance rate. Such acceptance rate is typically too high 
in ABC studies (a typical value would be 1% or less) however we plan to increase the number 
of clones and focus on the peak of such posterior, (iii) At this point 5 is kept fixed to its last 
value and data-cloning starts, by progressively increasing the value of K. (iv) Once R iterations 
are completed, we collect the draws generated with the largest value of K and use those for 
maximum likelihood inference. As such the resulting samples are not from Trs{9,z\y) but from 
the powered-approximated posterior 7rs{9, ..., for finite K 

K 

7rsi9, z’'^\ ..., z^^'>\y^^'>) oc Tr{9) Js{y, z^^">)L{9] z^^">) 

k=l 


where the L{9;z^^^) can be simplihed out in the Metropolis-Hastings acceptance probability as 
previously illustrated. As mentioned in section 2.1, we assume as implicit the dependence on data 
via summary statistics, therefore here and in the rest of our work Js{y, z^^'>) = Js{S{y), S{z^^'))). 

Of course enlarging K shrinks the area of the support of the cloned posterior where most of 
the probability mass is located, hence it becomes increasingly difficult to explore a progressively 
peaked surface: this is why in step (ii) of the schedule above we do not recommend to go below, 
say, 10-20% acceptance as this rate will reduce drastically when K increases in step (iii). 

In our applications we use a Gaussian kernel, that is 


Js{y,z^’^'>) oc exp{-(5(z('^)) - 5(y))'D-i(5(z('=)) - S{y))/26^} (9) 

where ' denotes transposition. Clearly equation Q respects desired criteria, i.e. (i) it is constant 
when S{z^^'>) = S{y) and (ii) it gets larger values when S{z^^')) ~ S{y). For a generic 2 ; this 
implies writing S{z) ~ Nd^{S{y),6'^Q) with a ds-dimensional Gaussian distribution 

centred at S{y) and D a positive dehnite matrix. For simplicity we assume a diagonal D with 
elements D = diagja;^,..., }. Of course, using a diagonal Q might have an impact on the 

inference as it does not take into account the correlation among summary statistics. Recall 
that we keep writing Js{y,z) instead of Js{S{y), S{z)), the dependence on data via summary 
statistics being considered as implicit. 

When the elements in vector S'(-) are varying approximately on the same range of values it 
is possible to consider (wf, = (1,..., 1), however in general the variability of the statistics 


6 



is unknown and, depending on the type of data and the underlying model, these can have very 
different magnitude. An inappropriate choice for the elements in n affects the accuracy of the 
ABC inference negatively, with Jsi') being dominated by the most variable statistic so that 6 will 
bound the distance with respect to such statistic, not the remaining ones. To our knowledge 
the only systematic study on the weighting of summary statistics in ABC is Prangle (2015). 
When using summary statistics in our experiments, before starting the data cloning procedure 
we run a pilot study using ABC-MCMC (i.e. K = 1) with (wi,..., = (!,...,!) and collect 

the values of the accepted S{z). At the end of the pilot run we compute (after some appropriate 
burnin) the mean or the median absolute deviation MAD for each coordinate of the accepted 
S{z) and dehne (wi,..., := (MADi,..., MAD(i^). Then we plug the obtained D to weight 

the summary statistics into data-cloning ABC (introduced in sections 2.2-2.3) or in a further 
run of ABC-MCMC, for comparison purposes. See [Prangle (2015) for a thorough study on this 
approach. 

For ease of reading we produce two ABC-DC algorithms: the “static” ABC-DC is given in al¬ 
gorithmic where both 6 and K are assumed fixed, allowing for a more immediate understanding. 
However, in our applications we use algorithm which is discussed later. 


Algorithm 2 Static ABC-DC 

1. Initialization: Fix a starting value 9* or generate it from its prior Tr{6) and set 9i = 9*. 
Set j = 1, fix (5 > 0 and a positive integer K. A vector of statistics S{-) and weights D is 
available. 

2. Generate K independent values of the latent process houi p{X\9*). Condi¬ 
tionally on each generate a corresponding from equation (j^) for each k = 1, 

3. Calculate Js{y,z*^^^) for every k and compute 


q 


* 




k=l 


4. Generate a 0^ ~ u{9'^\9*). Generate K independent from p{X\9^) and corre¬ 

sponding 

5. Calculate Js{y, for every k and set = Ilk=iJ5iy,z*‘-’^'>). Generate w ~ t/(0,1), 

and calculate 


a = min 



u{9*\9*) 

u{9#\9*) 


7T{9*y 

tt{9*) 


If w > a, set := 9j and, increase j by 1 and go to step 6. Otherwise set 9j^i := 0^, 

9* := 0^, q* := increase j by 1 and go to 6. 

6. Repeat steps 4-5 as long as j < R. 


2.3. Dynamic ABC-DC 

In our experiments, and unlike in Lele et al. (2007) and Jacquier et al.| ( |2007[) where K is kept 
hxed during the MGMC algorithm execution, we let K increase (see also jPoucet et aT 2002). 
A “dynamic” version of ABC-DC with varying 5 and K is presented in algorithmic Notice that 
previously cited work did not use ABC with data-cloning, so to the best of our knowledge ours 
is the hrst work proposing doing so. 

As discussed by Christian P. Robert at http://xianblog.wordpress.eom/2010/09/22/ 
feedback-on-data-cloning/ data-cloning share features with the simulated annealing global 
optimization method: keeping K fixed to a high value once and for all removes the dynamic 
features of a simulated annealing random walk that first explores the whole space and then 
progressively focus on the highest modes, achieving convergence if the cooling is slow enough. 
In other words, if K is “large enough”, the Metropolis algorithm will face difficulties in the 
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exploration of the parameter space, and hence in the subsequent discovery of the global modes, 
while, if K is “too small”, there is no certainty that the algorithm will identify the right mode 
of possibly multiple modes. Here follow a “dynamic” ABC-DC algorithm, where schedules are 


defined for 5 and K, namely {5r^,5r2, 


,} and {Ks^,Ks 2 ,...,Ks„ 


} with (5ri > ••• > > 0 


and 1 = Ks^ < ... < This version of ABC-DC starts with si iterations of ABC-MCMC 
algorithm with decreasing thresholds, where si = K = I constantly throughout the 

Si iterations. The threshold is 5 := 5r^ in the hrst ri iterations, 5 := <5^2 in the next r 2 
iterations etc. In summary the first ri iterations use {S,K) := and in general during 

iterations {ri : n+i) we use {6,K) := (d^;,!). During the last iterations of ABC-MCMC 
we keep track of the maximum value maxlvr^^^(0|i/)} of the approximated posterior TTs^^iOly) 
and corresponding 9 := argmaxvr^^^ (0|y): this is easily accomplished and cheap to implement 
by initializing maxlvr^^^ (0|y)} := 0 just before setting {6,K) := Then, whenever we 

have Js{y, z^)tt{9^) > maxvr^^^ (0|i/) for the current maximised value of the posterior kernel, 
we set maxlvr^^^ (0|y)} := Js{y, z^)7r{9'^) and 6 := 6^. This search for the maximum has to 
be performed only when 5 = 6rp as we are not interested in the maximum obtained for poorer 
approximations of the posterior. An alternative approach to the search for a mode 6 is given by 
adjusting the output of ABC-MCMC as from Beaumont et al. (2002), hence step 5' denoted as 
“optional” in algorithm]^ this step is described in section 2.4 


Algorithm 3 Dynamic ABC-DC 
ABC-MCMC stage 

1. Initialization: Fix a starting value 9* or generate it from its prior tt{9) and set 9i = 9*. Set j := 1, 

6 := Sri := 0 . 

2. Generate X* from p{X\9*) and a corresponding z* from Q. Compute q* = Js{y,z*). 

3. Generate 0^ := AMRW(0*,Ej). Generate X'^’s from p{X\9'^) and corresponding . Compute 
q* = Js{y,z*). 

4. Generate w ~ C/(0,1), and calculate 


a = min 1, 





7r(0#)’ 


If w > a, set 0j+i := 9j otherwise set 0j+i := 0^, 9* = 0^ and q* := . If ^ = Sr^ go to 5. Otherwise 

increase j by 1 and if j S {r2, ■■.,rp\ update d := Sj. Then go to 3. 

5. Check current maximum (only when 5 = Srp)'- if q'^TT{9'^) > maxirs^^ set maxTr^^^ := q'^Tr{9'^) and 
9 := 9"^. Increase j by 1. If j < Si go to 3 else go to step 6. 

5h (optional) Apply regression adjustment on the draws obtained with Sr^. Take as 0 either the mean 
or the mode of the adjusted draws and call Eg, the covariance of the adjusted draws. 


Data-cloning stage 

6. Take the last accepted value 0*, 0 and the current covariance E^^ from ABC-MCMC. Set E/j := Eg,^, 
K := Ks 2 and S := Sr^- 

7. Generate K independent vectors denoted X*^^\X*^^'> from p(A|0*). Conditionally on each 
X*9^l generate a corresponding z*^^\ Calculate q* = Ilfci Js{y,z*^^^). 

8. Generate 0"^ := MIS(0, E^). Generate K independent from p(A|0'^) and corresponding 

2 :#(fe), Compute q* = Jl^i Js{y-,z*^^'l). 

9. Generate w ~ ?7(0,1) and calculate 


a = min 



X 


U2{9*\9, Efc) 

U2{9*\9,tk) 


X 


7r(0#)' 

7r(0*) 


If a; > a, set 0j_|_i := 9j otherwise set 0j+i := 0^, 9* := 0^ and q* := . Increase j by 1. If 

j G {s 3 ,..., Sq} increase K and update E^ := cdv{9)k> then go to 7, otherwise go to 10. Here c6v{9)k' is 
the sample covariance computed on draws generated using the previous value k' of K in the schedule. 

10. If j < i? go to 8 otherwise stop. 
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During ABC-MCMC we generate parameter proposals using adaptive Gaussian Metropolis 
random walk (Haario et al., 2001). We write 9^ := AMRW(0*,Sj) to denote such a proposal 
(and the corresponding Gaussian proposal function is denoted ui(-)). During the exploration 
of the approximate posterior surface with K = \ we aim at locating the principal mode 9 
of the distribution for the smallest threshold <5^^ as described above. At the end of the si 
iterations of ABG-MCMG we fix (5 := drp for the rest of the ABG-DC execution and increase K 
progressively. At this stage parameters are proposed using a Metropolis independent sampler 
MIS(0, Sfc) generating from the Gaussian distribution N{9, (and the corresponding Gaussian 
proposal function is denoted U 2 (-)) where is the sample covariance matrix obtained from draws 
generated using the previous value of K. 

We switched from a Gaussian random walk to an independent Metropolis sampler when 
powering-up because ABG-MCMG should have located the bulk of the approximated posterior 
(optionally with the help of the regression adjustment in section 2.4), as well as its highest mode, 
and therefore we use such information to propose parameters in the next stage corresponding to a 
larger K. Not following random walk dynamics helps in avoiding getting trapped in local modes 
that might emerge when powering-up the posterior surface for increasing K. This is why, in order 
to accommodate the reduced support of the current targeted distribution for increasing K, we 
recompute the covariance matrix E^ for the independence sampler at iterations j G {si,..., Sg}. 

Notice that at the end of step 9 when we reach an iteration j G {si, ...jSq} and K has to 
be enlarged, we need to “balance” the information contained in the numerator of a with the 
one in the denominator, and this is why we go back to step 7 instead of 8 and recompute q*. 
Basically after increasing K the number of clones in the numerator would be larger than the one 
in the denominator, hence the need to first go back to 7, where we recompute the denominator 
using the same value of K employed for the numerator. This re-evaluation is performed only 
when j G {si,..., Sq} and we can safely interpret this step as the starting point for a new chain, 
targeting the corresponding powered distribution. The effect of step 7 is removed after some 
burnin period (and a presumably short one, as at this point the chain should already be in 
the bulk of the powered posterior). In the end, from the inference point of view, what matters 
are the draws generated at the largest K, which are generated with a fixed 6 = 6rp, hence are 
genuinely distributed according to the corresponding (marginal) posterior TTs{9\y^^^). 

For complex models a slow transition between different number of clones Kg., might be 
required, meaning that differences Kg., — Kg.,^^ should not be too large. This is because the 
chain needs to adapt to a narrower support for the posterior when using Kg.,^_^ clones, and 
to ease the exploration we use the covariance obtained from draws generated with Kg., clones, 
which is hopefully appropriate. Otherwise if the difference above is too large the covariance used 
will be inappropriate (too large variances) and many proposals will be rejected. 

A maximum likelihood ABG algorithm has also been proposed in Rubio and Johansen (2013), 
however in that work a non-parametric kernel estimator of the posterior density is constructed 
(using draws from the ABG rejection algorithm proposed in Pritchard et al. 1999), then the 


maximizer of the non-parametric density is found numerically. In our approach we do not require 
any kernel estimation procedure (which is onerous unless the dimension of 9 is low), nor direct 


optimization procedures. An approach similar to the one by Rubio and Johansen is in Grazian 


and Liseo (2015). 


In conclusion, we use draws produced by algorithm under {S,K) = {6rp,Kg^) and for 
these draws we compute their sample mean 9^ x to obtain an approximate maximum likelihood 
estimate (MLE) of 9. In principle, if we were able to construct sufficient summaries S(-) for 9, 
and by letting J —)• 0 and then K —)> oo, we would have 9mle = 9s^k and '^mle = AT • ^s,K, 
where 9mle is the MLE of 9 and T^mle is the covariance of the MLE (i.e. the inverse of 
the Eisher information based on y) computed using the sample covariance derived under 
{5,K) = {6rp,Kg^). The reasoning behind these results, when S{-) is sufficient, is that (i) for 
an ABG approximation to a posterior it holds that in distribution lim 5 _ 5 .o'/r 5 ( 6 *|y) = 7 r( 0 |y), 
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and that (b) data-cloning implies that in distribution lim^^oo = N{9mle,^mle) 

Therefore by hrst taking the limit for 5 and then 


Jacquier et al. 

2007, 

Lele et al. 

2010 


applying the limit for K we have that 

(^5,K ~ ^{(^MLE, ^MLe), 


0,K 


oo. 


Then (when asymptotics holds) it would be easy to compute approximate standard errors and 
construct confidence intervals for the true value 9° of 9 (using the fact that 9mle when 
n —>• oo), by noting that J^mle is provided “for free” as detailed above. However, in reality our 
reasoning is based on not using a small <5, therefore the conhdence bounds resulting from using 
asymptotics are often wide, even though in our experiments we obtain good point approximations 
to the MLE. This is because 5 is large and an increasing K does not necessarily reduce the chain 
variability for all parameters. Therefore multiplying 'Ss^k by K might give too large standard 
errors (this fact holds regardless of whether we are able to make use of sufficient summary 
statistics, which is in general not the case). This is particularly true when summary statistics 
are not informative for a certain parameter, see the case of parameter logcr in Figure For all 
these reasons, focus of this work is on parameters point estimation. 

In conclusion, even if we use a 6 which is not small for accurate Bayesian inference, but small 
enough for locating the maximum of the posterior, then we can power-up the ABC posterior and 
propose samples around such maximum to obtain a good (point) approximation of the MLE. 
A formal study on the properties of the obtained estimators for positive 5 and finite K is not 
considered here and is left for future research. 


Finally note that in Lele et al. (2010) it is shown that a criterion to choose the number of 


clones is to monitor the decay (as K increases) of the largest eigenvalue Ai(iL) of the sample 
covariance matrix obtained from a chain using K clones. They used such criterion to diagnose 
parameters estimability, that is if Ai(iL) decreases to zero then the parameters are deemed es¬ 
timable, with a covariance matrix approaching degeneracy. The examples treated with standard 
data-cloning (i.e. papers not using ABC methodology) are sometimes able to consider clones in 
the order of hundreds ( Baghishani and Mohammadzadeh[ [2011 ) with a good acceptance rate, 
hence it makes sense to monitor the convergence of Xi{K). When using ABC we do not enjoy 
the luxury of letting an automatic criterion decide when to stop increasing K, as we are anyway 
bounded to use a much smaller number of clones with a small acceptance rate. 

2.4- Regression adjustment 


In section 3.1 we consider an example with a static model, where ABC inference is easily 
enabled thanks to the existence of intuitive and informative summary statistics (albeit not suffi¬ 
cient ones). However for dynamic models it is usually way more difficult to identify informative 
summaries. Therefore in section [3^ we automatically retrieve summaries using the method in 


Fearnhead and Prangle (2012). Although automatized construction of summaries is certainly a 


handy tool, since we plan to use a large threshold 5 at the end of the burnin {K = 1) phase it 
may be useful to “adjust” the obtained draws before starting the data-cloning phase, in order 


to obtain a more informative independence sampler. For the example in section 3.2 we consider 


the regression adjustment proposed in Beaumont et al. (2002). In this section for ease of writing 
we assume a scalar 9. Denote with 9s = {9i ,..., S^p) the sequence of draws for 9 produced at the 
smallest threshold 6 = Sr^ when K = I and denote with (S'!,..., Srp) the corresponding simulated 
summary statistics (each summary can be a vector). Consider the following regression model 

9i = a +{Si - Sy/3 + ^i, i = l,...,rp (10) 

where S is the summary statistic for the observed data, a and fd are regression parameters and 
is mean zero homoscedastic noise (see Blum et al.| 2013 for alternative approaches). Parameters 


10 























{a, /?) can be estimated via local linear regression by minimizing the following criterion 




i=l 


for some appropriate kernel Js (e.g. uniform, Gaussian kernel). Solution to the least squares 
problem is given by 

(aj) = {Z^WZy^Z^WOs 


Beaumont et al. 


2002 


where Z is the design matrix for model (|10|) and W a diagonal matrix with ith entry given by 
J5r„{S,Si) (see 


for details). The adjusted parameters are given by 


9* = ei- {Si - sfy 


i = 1, ...,rp. 


When the employed 5r„ is relatively large the posterior obtained from the adjusted draws 9* is 


usually more concentrated than the one based on 9i, see section 3.2 This means that, based on 


the set of 9*, we are able to construct a more informative empirical covariance matrix for the 
independence sampler used when K > 1. Also, we may consider taking the mean or median of 
the adjusted parameters and centre the independence sampler at such value {9 in algorithm]^. 


3. Simulation studies 

This section considers approximate inference for three simulation studies. The hrst two 
studies deal with observations from a ^-and-A: distribution and from a state space model respec¬ 
tively, both lacking explicit expressions for the likelihood function. The third one is based on a 
two-dimensional stochastic differential equation with correlated noise. For the latter it is possi¬ 
ble to write the exact likelihood function and therefore identify by numerical optimization the 
maximum likelihood estimate, which we compare with the ABC-DC estimator. In all examples 
whenever we refer to ABC-DC we mean the “dynamic ABC-DC” in algorithmic 

From the computer coding point of view, the three examples are easily vectorised (our codes 
are written in MATLAB) and therefore simulating model realizations for a certain A' > 1, say 5 < 
K < 15, did not result in any serious slow-down compared to using K < 5. However, consider 
having a computationally expensive model simulator such that producing a single realizations 
from the model requires several seconds or minutes. Assume therefore that running many (R) 
iterations of an ABC-MCMC algorithm for Bayesian inference is impractical, whereas running 
i? <C A iterations of ABC-DC over M > 1 processors is feasible (for simplicity, assume K 
a multiple of M). Here the task of computing the cloned likelihood would be performed by 
distributing K/M model simulations to each of the M processors. 

In our examples timing is obtained on simulations running on a i7-4790 CPU 3.60 GHz PC 
desktop. 


3.1. g-and-k distribution 

An interesting case study is given by ^r-and-A: distributions, first analysed via ABC methods 
Allingham et al. (2009). This is a flexibly shaped distribution that is used to model non- 


m 


standard data through a small number of parameters. It is defined by its inverse distribution 
function, but has no closed form density. The quantile function (inverse distribution function) 
is given by 


F ^{x]A,B,c,g,k) = A-\-B 


1 + c 


1 — exp(—S' • r{x)) 
1 -|- exp(—S' • r{x)) 


(1 -|- r‘^{x))^r{x) 


( 11 ) 


where r{x) is the xth standard normal quantile, A and B are location and scale parameters 
and g and k are related to skewness and kurtosis. We assume 9 = {A, B, g,k) as parameter of 
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Figure 1: g-k distribution: trace plots for the ABC-MCMC run. A major variability reduction occurs at iteration 
10,000 when reducing S from 0.3 to 0.1. Horizontal lines give the true parameter values. 


interest, given that we keep c fixed to c = 0.8 (Rayner and MacGillivray, 2002). Parameters 
restrictions are B > 0 and k > —0.5. An evaluation of (11) returns a draw (xth quantile) 
from the g-and-k distribntion or, in other words, the ith sample := rj(x) ~ A^( 0 , 1 ) produces 
a draw Zi := A, B, c, g, k) from the g-and-k distribution. Notice in this case there is no 

hidden/latent process, hence all simulated values Zi are independent draws from said distribntion. 


We follow the simulation setup for data (yi,..., of size n = 10^ generated as in Allingham 


et al. (2009) (see also Fearnhead and Prangle, [2012 ) with 6 = (3,1,2,0.5). They set uniform 


priors 17(0,10) on each parameter then used a standard ABC-MCMC algorithm (Marjoram 


et ah, 2003) to estimate parameters, with summaries S{y) = (y(i), y{n)) (the sequence of 
ordered data) and J{y,z) = (X]r=i[‘^*('^) “ ‘S'i(l/)]^)^'^^ with Si the ith element of S and 2 ; = 
{zi,...,Zn) a vector of samples from the y-and-A: distribntion. They obtain good inference for 
all parameters but g which is essentially unidentified. Actnally it is extremely simple to obtain 
accurate inference for all parameters by reducing the dimensionality of the problem nsing a 
smaller set of summaries. We set S{y) = (P 2 O)-^ 405 -PeO;-PsOj skew(y)), that is the 20-40-60-80th 
percentiles of the data and the sample skewness. As comparison function J 5 we consider a 
different criterion, namely a Caussian kernel as in ([^. With such setup, we first run a standard 
ABC-MCMC without data-cloning and let 5 decrease with schedule 6 G {5,3,1}. Results were 
not encouraging, and that’s because of using a matrix of weights with unit diagonal thns 
giving the same weight to each of the five summary statistics. Then we formed a new matrix of 
weights from the output of such preliminary (pilot) rnn, as described in section 2.1 to obtain 
[wi, ...,a; 5 ] = [0.22,0.19,0.53,2.96,1.90]. With the new 12 we rnn ABC-MCMC once more, this 
time with 5 G {0.3,0.1,0.05,0.015} where the largest value of S was used for the first 10,000 
iterations, then decreased every 10,000 iterations and the smallest value was used for the last 
20,000 iterations of overall R = 50,000 ABC-MCMC iterations, starting from parameter values 
(5,5,3, 2 ). The 50,000 iterations were completed in 103 seconds. Trace plots are in FigureAt 
the smallest 6 = 0.015 we obtain an acceptance rate of 1-2% and corresponding posterior means 
and 95% posterior intervals: A = 2.98 (2.97,2.99), B = 0.95 (0.92,0.98), g = 1.95 (1.82,2.07), 
k = 0.53 (0.50,0.56). 


12 










































5 



0 0.5 1 1.5 2 2.5 3 


A xIO^ 


1 

ii 1 





O' -^^^^^- 

0 0.5 1 1.5 2 2.5 3 


g xIO^ 


6 

4 

2 


0 

0 0.5 1 1.5 2 2.5 3 

B x10* 

2 

1.5 
1 

0.5 
0 

0 0.5 1 1.5 2 2.5 3 

k x10^ 




Figure 2: g-k distribution: trace plots for the ABC-DC run. Here <5 = 0.3 constantly and the variability 
reduction at iteration 7,000 is only due to K increasing from A = 1 to if = 15. Horizontal lines give the true 
parameter values. 


Considering a pilot run was doubly useful, because we can now use the determined O into 
ABC-DC. We start the algorithm at the same starting parameter values used in ABC-MCMC 
and keep the threshold fixed to a large value, that is 5 = 0.3 and in this case we do not let it 
decrease. Notice that 5 = 0.3 is the largest value used in the previous ABC-MCMC experiment. 
The first 7,000 iterations are run with K = 1, useful to identify a temporary main mode 9 
(see Figure]^, then we enlarge it to AT = 15 and during the next 20,000 iterations observe an 
acceptance rate of 1-2%. Therefore we ran in total R = 27,000 iterations which are completed 
in 402 seconds. Trace plots are in Figure Using {S, K) = (0.3,15) we obtain the following 
asymptotic means and standard errors from the large samples arguments outlined in section [2^ 
A = 2.99 (0.07), B = 0.98 (0.26), g = 1.97 (0.77), k = 0.48 (0.14). As previously remarked, 
in the present work we focus on point estimation and in this section confidence intervals based 
on asymptotics are reported only to show that these are likely to be overestimated when using 
ABC-DC, see the discussion below where we also consider a bootstrap approach. We considered 
the setup above for ABC-DC to perform a fair comparison with ABC-MCMC, namely an R 
long enough to return 20,000 draws (when K = 15) and a K large enough to return the same 
acceptance rate as in ABC-MCMC. Of course the setting is time consuming, however we can 
show how to obtain the same results using a quicker ABC-DC. We run 7,000 iterations with 
((5, K) = (0.3,1) then increase the number of clones to AT = 5 for further 5,000 iterations; overall 
time is 48 seconds and the acceptance rate is 10-12%. We obtain A = 2.98 (0.06), B = 0.97 
(0.23), g = 1.99 (0.74), k = 0.49 (0.16) essentially the same results obtained under the more 
expensive setup with computations an order of magnitude faster. For both ABC-DC simulations 
it is interesting to appreciate how the variability of the chain for parameter g (when 77 = 1), 
and the automatic identification of its maximum, allowed for a dramatic improvement in the 
mode identification for a larger K. See Figure]^ for the approximate marginal distribution of g 
based on A' = 1: ABC-DC automatically identifies the highest mode at about g = 2 and focuses 
on such mode for an increasing K. 

We conclude that ABC-DC returns very good point estimates, however standard errors are 
likely to be overestimated. To verify the latter claim we run a parametric bootstrap procedure 
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Figure 3: g-k distribution: marginal posterior for g when K = 1 and 5 = 0.3. At this point the current main 
mode 6 is automatically identified by ABC-DC and the algorithm will use it to propose samples under a larger 
value of K. 

Table 1: g-k distribution: means over 100 parametric bootstrap replications and 2.5-97.5% 
empirical percentiles using an exact MLE procedure and ABC-DC. For each replication using 
ABC-DC we considered {S,K) = (0.3,5). 



True values 

MLE 

ABC-DC 

A 

2.98 

2.99 [2.97,3.01] 

2.98 [2.95,3.01] 

B 

0.97 

0.98 [0.95,1.02] 

0.98 [0.91,1.04] 

9 

1.99 

1.97 [1.92,2.02] 

2.15 [1.89,2.70] 

k 

0.49 

0.49 [0.47,0.51] 

0.48 [0.41,0.58] 


of size 100: that is we produce 100 independent datasets using the parameters obtained under 
the more computationally conservative approach [A = 2.98, B = 0.97, g = 1.99, k = 0.49 
when K = 5 and 5 = 0.3) then re-estimate parameters on each simulated dataset via ABC-DC. 
For each simulation we used, again, 7,000 iterations with (6, K) = (0.3,1) followed by 5,000 
iterations with {d,K) = (0.3,5). Bootstrap results are in Tableshowing reasonable variation, 
whereas previously standard errors built on asymptotic arguments are overestimated. Finally, we 
perform a parametric bootstrap procedure employing an exact maximum likelihood estimation 
obtained by numerical optimization (Rayner and MacGillivray |2002[ ), using the R package 
in https://github.com/dennisprangle/gk, Results are in Table and we can once more 
appreciate the good approximation provided by ABC-DC. Thanks to this further simulation, 


we can definitely confirm that trying to apply the asymptotic considerations from section 2.3 
to compute standard errors (instead of, say, bootstrap procedures) results in inflated confidence 
intervals. 


3.2. Stochastic Gompertz model 

Here we consider a state-space model with stochastic Gompertz dynamics. [Donnet et al. 


(2010) and Ditlevsen and Samson (2013) used a hierarchical (mixed-effects) version of this 
model to study chicken growth. We do not use a mixed-effects model so our results cannot be 
directly compared to the cited references. We have the stochastic differential equation (SDE) 


dXt = BCe-^^Xtdt + aXtdWt, Xq = Ae 


-B 


( 12 ) 


with A, B, C and a unknown positive constants. It is easy to prove by Ito’s formula on the 


transformed process Zt = log(At) that (12) has explicit solution Xt = Ae ^ ^ 2 °" 

Xq = Ae~^. Same as in Donnet et al. (2010) and Ditlevsen and Samson (2013) we consider 


data on a logarithmic scale according to 

{yo,yi,-,yn) = (log(A) - R,log(Ai), ...,log(X„)) -be 
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Figure 4: Gompertz model: traceplots for log A, logC and logcr when using ABC-MCMC. Horizontal lines are 
parameter true values. 


where e ~ iVn+i(0, cr^/n+i) has (n + l)-dimensional multivariate Gaussian distribution (here 
In+i is the identity matrix). For simplicity we assume Xq known hence an estimate of either 
A or B can be determined from an estimate of the other parameter, e.g. B = \og{A/ Xq). In 
the following we choose to determine B. We assume known as this is a difficult parameter 
to identify without access to repeated measurements. Therefore unknowns are {A, C, a) and, in 
order to preserve positivity, in practice we conduct inference for 9 = (log log (7, log u). We 
set the following priors log^ ~ 17(1,15), logC ~ 17(0.5,4), a ~ LAf(0.1, 0.2). Here LN{a,b) 
denotes the log-Normal distribution with parameters (a, b) {a and b being the mean and standard 
deviation respectively of the associated Normal distribution). 

We simulate re+1 = 51 data points at equispaced observational times (to, ■■■An) = (0,1,..., 50) 
with parameters (logH, log H, log C, log cr, logUe) = (8.01,1.609,2.639,0, —1.609). In practice we 
normalize times to be in [0,1] for numerical stability. We wish to consider summary statistics to 
ease inference via ABC, however the determination of summaries for dynamic models is way less 


intuitive than for static models. We employ the regression approach suggested in Fearnhead and 


Prangle (2012) to determine a set of three summaries 5(•) = (S'! (•), S 2 (•), Ss (•)). Essentially their 
“semi-automatic ABC” is such that the vector S'(-) has the same dimension as 9, that is each 
element of 5(-) is supposed to be informative for a given component of 9. We do not illustrate 
their method here, but the reader can refer to Picchini| (2014) for an exposition targeting SDE 


models and to our MATLAB package (Picchini 2013) implementing the Eearnhead-Prangle 
method for the determination of S'(-) (but not implementing data-cloning). 

We first illustrate the results from an ABC-MCMC without data-cloning: we start at initial 
parameter values (log Aq, log Cq, log uq) = (11, 0.6, —2.3) and first run a pilot study to determine 
weights {uji,lo 2 ,uj 3 ) weighting the three automatically obtained summaries {Si, S 2 , S 3 ). Then 
we run ABC-MCMC once more using these weights and produce a total of i? = 40,000 draws. 
The first 8000 iterations use 5 = 20, then we decrease it to (5 = 5 for 7000 iterations and finally 
to 5 = 1.5 for the last 25000 iterations with a 1% acceptance rate at the smallest threshold. 
Automatic summaries construction together with the 40,000 ABC-MCMC iterations required 
about 55 seconds. Trace plots are in Eigurej^ Posterior means are log A = 7.52, log (7 = 2.58, 
logcj = —0.52. As we see in a moment the identification of a is difficult because we can’t really 
make use of an informative summary statistic. 

We now consider ABC-DC where AT = 1 is used for the first 10,000 iterations and we keep 
5 = 18 constant for the entire simulation, i.e. this 6 is about thirteen times larger than the 
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Figure 5: Gompertz model: ABC-DC with K = 1. Abscissas have values of simulated summary statistics and 
ordinates have the corresponding simulated parameters, labelled with gray circles (o). Black plusses (+) denote 
corresponding regularized parameters. 


smallest S used in ABC-MCMC. At the end of the 10,000 iterations we apply the regression 
adjustment described in section 2.4, This is trivially implemented and has negligible impact 


on the overall computational budget, amounting to solve a system of linear equations using 
draws already sampled at previous iterations, i.e. it is an almost instantaneous operation. 
In Figure circles denote pairs {6i,Si) where the 0* are draws generated with K = 1 (after 
burnin) and Si the corresponding three-dimensional simulated summary statistic, one for each 
dimension of 9i. Plusses denote the pairs {6^, Si), i.e. regression-adjusted parameters. We 
deduce that regularization enables an improved identihcation of A and C, which is of great 
help given the large value of 6 we are using. However the summary statistic used for a seems 
totally uninformative for the said parameter and in fact the regularization has no effect, see 
also Figure As explained in section 2.4 we can use the sample covariance from the adjusted 
draws to create a more effective independence sampler when K > 1. We employ this strategy 
here, and furthermore we center the independence sampler to the mean of the adjusted draws 6 * 
and execute 20,000 further iterations using K = 11 clones. Trace plots are in Figure Sample 
means computed on the last 20,000 draws returns point estimates log A = 7.88, logC = 2.71, 
logcj = —0.288 which are closer to the true parameter values than those obtained via ABC- 


MCMC. 

With K = 11 we obtain a 1% acceptance rate, for comparison with inference via ABC- 
MCMC. The entire algorithm (including the calculation of summary statistics and the regression 
adjustment) required about 48 seconds. This was possible thanks to a carefully vectorized 
MATLAB code so that simulating multiple instances of our model does not have any significant 
impact on the computational performance. Notice that without adjustment we are not able 
to jump from K = 1 straight to K = 11, as this would result in a extremely low acceptance 
rate. Finally, in order to show the robustness of the method, we perturb the parameters starting 
values and produce three different chains given in Figure all converging to the same values 
despite the large ABC tolerance and very different starting parameters. For example, in Figure 
[^we show the marginal posteriors for log A for the three chains resulting from K = 1 and before 
applying regression adjustment: as we can see the main modes are rather different. Despite this, 
for increasing K the three ABC-DC chains converge to about the same value, as we have just 
discussed. 

Our main comment is that we obtain a reasonable point estimate using ABC-DC without 
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Figure 6: Gompertz model: ABC-DC with K — 1. Kernel smoothed marginals for regression adjusted pa¬ 
rameters (solid lines) and from the original non-adjusted draws (dashed lines). For logcr the two marginals 
superimpose. 
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Figure 7: Gompertz model: traceplots for log A, logC and logn when using ABC-DC. Horizontal lines are 
parameter true values. 
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Figure 8: Gompertz model: three independent chains of ABC-DC starting at different values. 



Figure 9: Gompertz model: marginal posteriors without regression adjustment for each of three chains for log A 
resulting from iterations 7,000-10,000 in Figure [Sl that is from K = 1 and 5 = 18. See main text for comments. 
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having to reduce S too much, which should be particularly relevant for even more complex 
modelling scenarios, as commented at the beginning of section 


3.3. Two dimensional correlated Geometric Brownian motion 

In this section we apply ABC-DC to estimate the parameters of a two-dimensional SDE model 
having a likelihood function analytically available. It is therefore possible to compare results 
based on our methodology with exact maximum likelihood estimation. The model considered is 
a two-dimensional geometric Brownian motion or Black-Scholes model, the standard asset price 
model in finance. The process {Xt, is solution of the system of SDEs 


' dXt = niXtdt + aiXtdW^ 

dYt = ^i2Ytdt + a2YtdWi 


(13) 


with cri,(T 2 > 0, {VE/} and {Wf} correlated Wiener processes such that dW^ ■ dW^ = p ■ dt 
with a correlation coefficient \p\ < 1. We can explicitly write the two components of the exact 
solution for t G [0, oo) in terms of two independent Wiener processes {Bl'\ and as 


Xt = Xq exp \ { pi- -erf ) t + aiB} 


Yt = Yo exp 


P2 - t + 0-2 (^pBl + y/l-p'^B'^'j I 


where (Xo,lo) is the deterministic starting state of the process. By application of the multi¬ 
dimensional Ito formula we can derive an explicit expression for the conditional transition den¬ 
sities. It follows that for s £ [0,t), t G [0, oo), the transition density is 


p{s,Xs,ys-,t,xt,yt) = 


1 


2 TT{t - s)y^1 - p‘^aia2Xtyt 


exp^- 


(In(xt) - In(x^) - {pi - ^(Tt){t - s)y 

2cjf(t-s)(l-p2) 


(In(^t) - MVs) - {P 2 - - s)y^ 

2c7|(t-s)(l-p2) 




(In(xt) - ln(a;s) - {pi - gUi)(t - s)){ln{yt) - In(y^) - {p 2 - lcrl){t - s))p 

aia 2 {t - s)(l - p2) 


We assume observations generated from model (13), hence due to the Markovian property of the 


model solution it is possible to write the exact likelihood function for a discrete sample from the 
process as proportional to the product of its transition densities. MLEs are therefore obtained 
by maximizing numerically the resulting likelihood function, see Table 

We wish to conduct inference for the set of parameters 6 = (/Ui, In(cri),/i 2 , ln(cr 2 ), p) using 
500 equispaced observations taken on the time interval [0,1]. The starting state of the model 
is set to {Xo,Yq) = (1,2) and the true values of the parameters are pi = 1.7, Inui = —0.8, 
P 2 = 1.3, lnc72 = —1.2 , p = 0.3. The dimension of the problem creates difficulties in comparing 
data and pseudo-data, since to preserve a reasonable acceptance rate we would need to fix the 
threshold 6 to high values. However for this example we are able to identify sufficient statistics, 
resulting in much higher acceptance rates compared to using the entire dataset. Sufficiency 
implies that comparing summaries of data and pseudo-data is equivalent to comparing actual 
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and simulated data. We denote with {Mi^Vi,M 2 ,V 2 -,Ri-,R 2 ) the vector of sufficient statistics 
based on discrete observations {.^j}j=o,..,n = The statistics are 

n n 

J^(ln W - lnXi_i) = (InX, - InXo), Fi = - lnX,_i)2 

i=l i=l 

n n 

^(InFi - Iny^.i) = (lny„ - Inyo), ^2 = J^(lny* - Iny^.^^ 

i=l i=l 

n n 

J](lnW - lnXi_i)(lny, - Iny^.i), R2 = J]ln(Wyi). 

i=l i=l 

Since these statistics vary on different scales, we weight them using the estimated standard 
deviation dj of each statistic, obtained using a pilot run of ABC-MCMC considering 20,000 
iterations and using thresholds 6 = (1,0.9, 0.8), updated at iterations 7,000 and 14,000. We 
denote with S (z) = {Sj{z))j^-^ g the vector of statistics for a dataset z and with S{z*) the 
corresponding quantity for a simulated z*. By considering for Js {S{z), S{z*)) a Gaussian kernel 
n^i exp (—u/(2(i^)) /6, we weight summary statistics by writing u = D^T,D where = 

(^(^*(fc)) _ S{z))'^ and diagonal matrix S = diag (wi, with ujj = l/d|. 

We conduct inference on the logarithms of the two volatility parameters fii and 02 , since they 
are meant to be strictly positive. Moreover, the correlation parameter p should be in (—1,1), 
therefore we truncate its prior on this interval. We set priors N (1.5,0.5^) on the drift parameters 
Pi and p 2 , whereas the priors on Imri and lncj 2 are chosen to be N (—1,0.5^), and as prior on 
p we set a truncated Gaussian (0.5,0.3^) with subscript denoting the truncation at the 

specihed interval. The starting values for the parameters are set to 6q = (1.5, —1,1.5, —1,0.1). 
We compare the estimates obtained using ABC-MCMC and ABC-DC on a total of 100,000 
iterations. In both cases, the covariance for the adaptive Metropolis algorithm is updated every 
1,000 iterations. 

With ABC-MCMC the threshold 5 is updated every 10,000 iterations within the values 
(0.8, 0.5,0.4,0.3, 0.2), then after further 20,000 iterations is decreased to 5 = 0.2, and finally 
from iteration 60,000 to the end of the simulation the threshold is fixed at J = 0.15. During 
this last stage the acceptance ratio varies between 1 and 4%. After a total running time of 198 
seconds, we obtain the estimated posterior means 9 = (1.6426, —0.8495,1.2979, —1.1599,0.3860). 
Traceplots are given in Figure [TOl 

For comparison, similar settings are applied to estimate the model parameters with ABC- 
DC. The first 10,000 iterations are spent in the initial ABC-MCMC stage, where the threshold 

5 is updated once, at iteration 5,000, from 0.8 to 0.5. At the end of the ABC-MCMC stage 
the acceptance rate is about 15%. Then every 10,000 iterations the number of clones is in¬ 
creased progressively to 3, 5, 6, 7, and the final 40,000 iterations are run with K = 8. In 
this last stage the acceptance ratio is 1%. The simulation required 282 seconds, returning 

6 = (1.7065, —0.8695,1.3254, —1.1679,0.4075). However, good estimates with ABC-DC can be 
obtained with only 40,000 iterations, passing directly from 1 to 8 clones at iteration 10,000, 
without updating the threshold 6, with a final acceptance rate of 3.3%. In this case estimates 
are 9 = (1.7145, —0.8739,1.3378, —1.1804,0.4040), returned in only 112 seconds. Traceplots are 
shown in Figure [TT| A similar cut in the number of iterations failed with ABC-MCMC, since a 
slow reduction of the threshold 6 was required in order to preserve an acceptable mixing of the 
chains. Results are summarized in Table [2j 

The presented estimates are based on a single dataset generated from the set of parameters 
9, but in repetitions of the experiment ABC-DC has proved to be a robust method, as shown 
below. We now run B = 30 independent simulations: for each simulation a different dataset is 
generated using the same value for the true parameter 9 as considered in previous experiments. 
Then, each of these datasets is fitted using several algorithms, each algorithm returning the 


Ml = 

M2 = 

Ri = 
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True values 

Exact MLE 

ABC-MCMC 
K=\, (5=0.15 
100,000 iterations 

ABC-DC progressive 
A' = 8, 5 = 0.5 
100,000 iterations 

ABC-DC fast 

A' = 8, 5 = 0.8 
40,000 iterations 

Ml 

1.7 

1.7359 

1.6426 

1.7065 

1.7145 

In (Ti 

-0.8 

-0.8381 

-0.8495 

-0.8695 

-0.8739 

M2 

1.3 

1.2913 

1.2979 

1.3254 

1.3378 

In (T 2 

-1.2 

-1.1599 

-1.1599 

-1.1679 

-1.1804 

P 

0.3 

0.3742 

0.3860 

0.4075 

0.4040 


Table 2: True parameters, exact MLE, ABC-MCMC and ABC-DC estimates. Results comparably good can 
be obtained with ABC-DC in a bit more than half of the time required to ABC-MCMC. “ABC-DC progressive” 
considers the case where S is reduced and K is progressively increased. “ABC-DC fast” denotes a fixed S and a 
K increased rapidly. 


ABC-MCMC ABC-DC progressive ABC-DC fast 



Mean Bias 

RMSE 

Mean Bias 

RMSE 

Mean Bias 

RMSE 

Ml 

-0.0911 

0.2782 

-0.0674 

0.3605 

-0.0114 

0.3021 

Inui 

-0.0092 

0.0408 

-0.0134 

0.0435 

-0.0409 

0.0552 

M2 

0.0287 

0.2309 

-0.0244 

0.2241 

0.0021 

0.1727 

In (72 

0.0109 

0.0269 

0.0083 

0.0522 

-0.0093 

0.0317 

P 

-0.0039 

0.0423 

-0.0036 

0.0561 

0.0321 

0.0570 


Table 3: Mean bias and root mean square error for the parameters estimates on 30 different simulations. 


mean value of the proposed parameters. On the set of B estimates db we compute the mean bias 

Y2b=i0b — G)/B and the root mean square error (RMSE) \J'^b=i^^b — OYjB. Results obtained 
with ABC-MCMC and ABC-DC are comparable, see Table 

This example shows that both ABC-MCMC and ABC-DC can be applied with success to 
multi-dimensional SDE models when the likelihood function is unavailable, if we can identify 
a set of sufficient (or at least informative) summary statistics. The advantage of ABC-DC in 
this case is that we can obtain results comparably good in almost half of the time required for 
ABC-MCMC. 


4. Summary 


We have presented a strategy to integrate approximate Bayesian computation (ABC) in 
the so-called “data cloning” framework for approximate maximum likelihood estimation. A 
standard ABC-MCMC algorithm is initially used as a “workhorse” to locate an approximate 
maximum for the ABC posterior, which we use to center a Metropolis independent sampler, to 
be employed during the data-cloning stage. We note that the accuracy of the hnal inference, 
beyond mere identification of the location of the approximate MLE, can be enhanced for small 
5 and large K. However expecting our sampler to satisfy simultaneously both requirements is 
unrealistic as these are competitive criteria, particularly for highly erratic stochastic processes. 
That is, when considering an ABC framework it becomes increasingly unlikely to accept a 
proposed parameter as K increases. We note that for the considered examples ABC-DC produces 
reasonable inferences even though it is run with a larger 5 than typically desired. In previous 


works using data-cloning MCMC (Lele et ah, 2007, Baghishani and Mohammadzadeh, 2011 


Jacquier et al., 2007) the algorithms were started at a high value of TL, this opening the possibility 


for a chain to get stuck in some local maximum, instead we start the simulation with K = 1 and 


then we increase it. Eurthermore, in simulations considered in Baghishani and Mohammadzadeh 


(2011) the starting parameter for their MCMC experiments was set to the exact MLE, which 


will of course produce good results for K large since their MCMC procedure starts in a peaked 
distribution already centred at its maximum. 

Besides statistical inference considerations, a scenario where we could see our method being 
employed is when considering a computationally expensive model simulator such that producing 
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Figure 10: Geometric Brownian motion: (top) traceplots for /ii, logcii; (middle) ^ 2 , log 0-2 and 
(bottom) p when using ABC-MCMC and updating progressively the threshold 5 to the values 
(0.8,0.5,0.4,0.3,0.2,0.15). Horizontal lines are the true parameters. 




0 12 3 4 

P xIO* 


Figure 11: Geometric Brownian motion: (top) traceplots for pi, logui; (middle) p 2 , log<72 and 
(bottom) p when using ABC-DG and passing directly from 1 to 8 clones at iteration 10,000, 
without updating the threshold 5. Horizontal lines are the exact MLEs. 
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a single realizations from the model requires several seconds or minutes. Assume therefore that 
running many (R) iterations of ABC-MCMC is impractical, whereas running R R iterations 
of ABC-DC over M > 1 processors is feasible (for simplicity, assume K a multiple of M). Here 
the task of computing the cloned likelihood would be performed by distributing K/M model 
simulations to each of the M processors. 

A further application of the method can be envisaged when the experimenter is in need of 
reasonable and rapidly available estimates to be used as starting values for expensive procedures 
requiring a careful initialization, such as particle MCMC (pMCMC, Andrieu et al. (2010)). For 
example in Owen et al. (2015) it was found that using a specific ABC strategy (ABC-SMC) 
prior to starting pMCMC was of beneht, instead of investing a large amount of computational 
budget in trying to hand-tune pMCMC. 
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